{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "d1982d63",
   "metadata": {},
   "source": [
    "\n",
    "# Clinical Study Biostatistics: T-tests, ANOVA, and Regression Analysis\n",
    "\n",
    "This notebook demonstrates hypothesis testing, ANOVA, and regression analysis techniques using a hypothetical dataset. Links to the dataset are provided for reproducibility.\n",
    "        "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "326e1437",
   "metadata": {},
   "source": [
    "\n",
    "## Dataset Information and Download Links\n",
    "\n",
    "The examples in this notebook use a diabetes dataset. You can download the dataset from the following sources:\n",
    "\n",
    "1. **Kaggle:**\n",
    "   - [Diabetes Dataset - Kaggle](https://www.kaggle.com/datasets/mathchi/diabetes-data)\n",
    "   - This dataset provides detailed patient data for diabetes analysis.\n",
    "\n",
    "### Dataset Attributes\n",
    "\n",
    "- **Pregnancies**: Number of pregnancies.\n",
    "- **Glucose**: Plasma glucose concentration.\n",
    "- **BloodPressure**: Diastolic blood pressure (mm Hg).\n",
    "- **SkinThickness**: Triceps skinfold thickness (mm).\n",
    "- **Insulin**: 2-Hour serum insulin (mu U/ml).\n",
    "- **BMI**: Body mass index (weight in kg/(height in m)^2).\n",
    "- **DiabetesPedigreeFunction**: Diabetes pedigree function.\n",
    "- **Age**: Age of the patient.\n",
    "- **Outcome**: Class variable (0 = non-diabetic, 1 = diabetic).\n",
    "\n",
    "### Usage Notes\n",
    "\n",
    "- Ensure the dataset is preprocessed (handle missing values and normalize if required).\n",
    "- Refer to the [dataset documentation](https://www.kaggle.com/datasets/mathchi/diabetes-data) for more details.\n",
    "        "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7655c8cf",
   "metadata": {},
   "source": [
    "\n",
    "## Two-Sample T-Test: Comparing HbA1c Between Diabetic and Non-Diabetic Groups\n",
    "\n",
    "A t-test is used to compare the mean HbA1c levels between diabetic and non-diabetic patients.\n",
    "        "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b182263d",
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "import pandas as pd\n",
    "import matplotlib.pyplot as plt\n",
    "from scipy.stats import ttest_ind\n",
    "\n",
    "# Load the dataset (replace with your file path)\n",
    "data = pd.read_csv(r'C:\\Path\\to\\diabetes.csv')\n",
    "\n",
    "# Separate diabetic and non-diabetic groups\n",
    "diabetic = data[data['Outcome'] == 1]\n",
    "non_diabetic = data[data['Outcome'] == 0]\n",
    "\n",
    "# Perform t-test for HbA1c (using 'Glucose' as a proxy for HbA1c)\n",
    "t_statistic, p_value = ttest_ind(diabetic['Glucose'], non_diabetic['Glucose'])\n",
    "print(f\"t-statistic: {t_statistic}\")\n",
    "print(f\"p-value: {p_value}\")\n",
    "\n",
    "# Plot mean glucose levels with error bars\n",
    "means = [diabetic['Glucose'].mean(), non_diabetic['Glucose'].mean()]\n",
    "stds = [diabetic['Glucose'].std(), non_diabetic['Glucose'].std()]\n",
    "labels = ['Diabetic', 'Non-Diabetic']\n",
    "plt.bar(labels, means, yerr=stds, capsize=5)\n",
    "plt.ylabel('Mean Glucose Level')\n",
    "plt.title('Mean Glucose Levels with Error Bars')\n",
    "plt.show()\n",
    "        "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c4f0fe29",
   "metadata": {},
   "source": [
    "\n",
    "## ANOVA: HbA1c by BMI Categories\n",
    "\n",
    "ANOVA is used to evaluate differences in mean HbA1c levels across BMI categories.\n",
    "        "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6b62ecd2",
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "from scipy.stats import f_oneway\n",
    "\n",
    "# Create BMI categories\n",
    "def bmi_category(bmi):\n",
    "    if bmi < 25:\n",
    "        return 'Normal'\n",
    "    elif 25 <= bmi < 30:\n",
    "        return 'Overweight'\n",
    "    else:\n",
    "        return 'Obese'\n",
    "\n",
    "data['BMI_Category'] = data['BMI'].apply(bmi_category)\n",
    "\n",
    "# Perform ANOVA\n",
    "fvalue, pvalue = f_oneway(\n",
    "    data[data['BMI_Category'] == 'Normal']['Glucose'],\n",
    "    data[data['BMI_Category'] == 'Overweight']['Glucose'],\n",
    "    data[data['BMI_Category'] == 'Obese']['Glucose']\n",
    ")\n",
    "\n",
    "print(f\"F-value: {fvalue}\")\n",
    "print(f\"P-value: {pvalue}\")\n",
    "        "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c9566c0c",
   "metadata": {},
   "source": [
    "\n",
    "## Logistic Regression: Predicting Diabetes Using Clinical Variables\n",
    "\n",
    "Logistic regression is used to predict diabetes based on glucose, BMI, and age.\n",
    "        "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2909a5b0",
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "import statsmodels.api as sm\n",
    "\n",
    "# Prepare data for logistic regression\n",
    "data['Intercept'] = 1\n",
    "predictors = ['Glucose', 'BMI', 'Age', 'Intercept']\n",
    "logit_model = sm.Logit(data['Outcome'], data[predictors]).fit()\n",
    "\n",
    "# Print logistic regression summary\n",
    "print(logit_model.summary())\n",
    "        "
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
